Pollutants are a mix of harmful particles and gases in the atmosphere that results in air pollution. Higher and unhealthy concentrations of air pollutants lead to diseases, unsustainable household and industrial productions, and uninhabitable ecosystems. This research will focus on the air condition in Beijing, China and aims to seek possible solutions to improve the air quality by analyzing factors that affect the pollutants value. Beijing experienced relatively severe pollution in the 2010s because of the high-speed development of the economy and heavy traffic. This research uses a time series dataset of daily observations for 12 sites in Beijing from 2013 to 2017, and we will use regression models to find out some insights of the data as well as give a specific prediction model for the pollution.
In this analysis, we will clean up the data including obtaining the mean and max value of the air pollution data for each day so that we can do an exploratory analysis and make some plots about the difference of air pollution for each checking point. What’s more, we also demonstrate seasonal patterns and trends to get a deeper insight about the time-series data. Next, we will build models such as gradient boosting regression to predict whether the pollution level would go up or not in the future. We will first split our data into training, validation and testing data and then tune our hyperparameter such as the depth of our trees on training dataset. Furthermore, we will use model performance measures like RMSE to select the best model. After all these, we can get our overall predicted accuracy in our test data.
Our results will show that whether human sources, such as industry and transport, and natural sources, such as hour or weather would have significant effects on air quality or not. Results will also reveal which specific factor that has the most impact on the pollutant concentrations in each station. We also want to find out the pollution level between urban and rural, for example CO pollutants may have more severe impact in urban sites because of the exhaust emissions. To accomplish that, we need to import some external data such as the traffic and industrial data around the monitor site to figure out the external factor that influences the amount of pollution as well as how we can put forward some useful recommendations for the governments.
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
1.1 Group files
1.2 Create new variable 'Date'
1.3 Managing missing data
1.4 Indexing, selection, and filtering
1.5 Merging and reshaping data
We have 12 datasets and each is about the air quality deteceted by a station, the following code is to group the 12 datasets to one dataframe and then add a new variable "Date" and process null value.
#Read all files
path = "/Users/Shirley/Desktop/758XPython/project/PRSA_Data_20130301-20170228/"
df1=pd.read_csv(path + 'PRSA_Data_Aotizhongxin_20130301-20170228.csv')
df2=pd.read_csv(path + 'PRSA_Data_Changping_20130301-20170228.csv')
df3=pd.read_csv(path + 'PRSA_Data_Dingling_20130301-20170228.csv')
df4=pd.read_csv(path + 'PRSA_Data_Dongsi_20130301-20170228.csv')
df5=pd.read_csv(path + 'PRSA_Data_Guanyuan_20130301-20170228.csv')
df6=pd.read_csv(path + 'PRSA_Data_Gucheng_20130301-20170228.csv')
df7=pd.read_csv(path + 'PRSA_Data_Huairou_20130301-20170228.csv')
df8=pd.read_csv(path + 'PRSA_Data_Nongzhanguan_20130301-20170228.csv')
df9=pd.read_csv(path + 'PRSA_Data_Shunyi_20130301-20170228.csv')
df10=pd.read_csv(path + 'PRSA_Data_Tiantan_20130301-20170228.csv')
df11=pd.read_csv(path + 'PRSA_Data_Wanliu_20130301-20170228.csv')
df12=pd.read_csv(path + 'PRSA_Data_Wanshouxigong_20130301-20170228.csv')
#Concanate all files
frames=[df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12]
df = pd.concat(frames)
df.drop(['No'], axis=1, inplace=True)
First, we need to create a new column called 'Date' which concate columns 'year', 'month, and 'day' because we want to analyze our data by days. We then transform these three columns to string types and then combine them together.
#Create new varaible 'Date'
df[['year', 'month', 'day']]=df[['year', 'month', 'day']].astype(str)
df["Date"] = df[['month', 'day', 'year']].agg('/'.join, axis=1)
def func_time(x):
x = datetime.strptime(x,'%m/%d/%Y')
return x
df['Date'] = df['Date'].apply(func_time)
We observed that some of our variables have null values, then we fill these nulls with the mean of its column. For those varibles cannot fill in the mean value, we drop them.
#process null value
#fill null value with the mean value of the data at the station
df=df.fillna(df.groupby(['station', 'Date']).transform('mean'))
#some data has completely null value for all 24 hours, also the datatype of 'wd' is object
#which we cant fill null with mean value, so we need to drop them
df.dropna(inplace=True)
df.isnull().sum()
Next, we can group our data by date to get the mean and max value of each pollutant for each day.
#Add suffix for each pollutant
ddf2=df.groupby(['Date', 'station']).mean()[['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'WSPM']].add_suffix('Mean')
ddf2
ddf3=df.groupby(['Date', 'station']).max()[['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3']].add_suffix('Max')
ddf3
After we get the mean and max value of each pollutant for each day, we combine merge two dataframes.
ddf4=pd.merge(ddf2, ddf3, on=["Date", 'station'])
ddf4
Then we add stations and wd into our dataframe. We notice that wd may have different values within a day, and we just use the mode.
ddf5=df.groupby(['Date', 'station'])['wd'].agg(lambda x:x.value_counts().index[0])
ddf6=pd.merge(ddf4, ddf5, on=['Date', 'station'])
ddf6
ddf6['wd']
In the dataset, it is given the pollutant value for every hour each day at 12 stations. The following code is meant to find the max value of pollutant in each day at each station and also the hour corresponding to the appearance of first max value.
#define function-find the max value of each pollutant and related hour in each day in each station
def findmax(df, value):
df2=df.groupby(['station', 'Date'])[value].max()
df3=pd.DataFrame(df2)
df4=pd.merge(df[['station', 'Date', 'hour', value]], df2, on=["Date", "station", value], how='inner')
return df4
#apply function on each pollutant and rename max hour for each pollutant
df_SO2= findmax(df, 'SO2').rename(columns={"hour": "SO2 1st Max Hour"})
df_NO2 = findmax(df, 'NO2').rename(columns={"hour": "NO2 1st Max Hour"})
df_CO = findmax(df, 'CO').rename(columns={"hour": "CO 1st Max Hour"})
df_O3 = findmax(df, 'O3').rename(columns={"hour": "O3 1st Max Hour"})
#clean format
df_SO2= pd.DataFrame(df_SO2.groupby(['station', 'Date', 'SO2'])['SO2 1st Max Hour'].min()).reset_index()
df_O3 = pd.DataFrame(df_O3.groupby(['station', 'Date', 'O3'])['O3 1st Max Hour'].min()).reset_index()
df_CO = pd.DataFrame(df_CO.groupby(['station', 'Date', 'CO'])['CO 1st Max Hour'].min()).reset_index()
df_NO2 = pd.DataFrame(df_NO2.groupby(['station', 'Date', 'NO2'])['NO2 1st Max Hour'].min()).reset_index()
2.1 Geographical Analysis
2.2 Data Prepartion
2.3 Panel
We'll take a look at the location of each station and group them for better analysis.
#Get a list of station
Sta=list(pd.unique(df.station))
Sta
#google the coordinated for each station
Longtitude=[116.407, 116.23, 116.17, 116.434, 116.361, 116.225, 116.644,116.473, 116.72, 116.434, 116.315, 116.366]
Latitude = [40.0031, 40.1952, 40.2865, 39.9522, 39.9425, 39.9279, 40.3937, 39.9716, 40.1438, 39.8745, 39.9934, 39.8673]
location=pd.DataFrame({'Station':Sta,'Longitude':Longtitude, 'Latitude':Latitude})
#visualize the location of each station
%pylab inline
import plotly.graph_objects as go
mapbox_access_token = 'pk.eyJ1IjoieHVlbGluZzExMDkiLCJhIjoiY2s5ZGF3M3RiMDF0ZDNnbnFya3EwMHloZiJ9.eKzDHTuUnEuQLyywpldF4w'
beijing_map_data = go.Scattermapbox(
lon = location['Longitude'],
lat = location['Latitude'],
text = location['Station'],
hoverinfo='text',
mode = 'markers',
marker = dict(
color = 'red',
symbol = 'circle',
opacity = .5
)
)
beijing__map_layout = go.Layout(
title = 'Beijing Air Quality Station',
mapbox=go.layout.Mapbox(
accesstoken=mapbox_access_token,
zoom=2
)
)
beijing__map = go.Figure(data=beijing_map_data, layout=beijing__map_layout)
beijing__map.show()
From the map, we can group 12 stations to two groups: countryside and downtwon.
countryside=['Dingling', 'Changping', 'Huairou', 'Shunyi']
downtown=list(set(Sta)-set(countryside))
After define region, we can see the distribution of each pollutants for the two regions, but first let's group data from each region
data=ddf6.reset_index()
data['year'] = pd.DatetimeIndex(data['Date']).year
data['month'] = pd.DatetimeIndex(data['Date']).month
data['day'] = pd.DatetimeIndex(data['Date']).day
all_region = {'countryside':countryside, 'downtown':downtown}
data['Downtown']=np.where(data['station'].isin(downtown), 0, 1)
data1 = data.set_index(keys=['station', 'Date']).sort_index(level=[0,1])
#get variable names
pollutant_keys1 = ['NO2', 'O3', 'SO2', 'CO']
pollutant_vals = ['Mean', 'Max']
all_indicators = []
for x1 in pollutant_keys1:
for x2 in pollutant_vals:
all_indicators.append(x1+x2)
identifier = ['station', 'Date']
time_feature = ['year', 'month', 'day']
data2 = data1[all_indicators].reset_index()
data2.head()
cols = [x+pollutant_vals[0] for x in pollutant_keys1]
dff0 = data[data['Downtown'] == 0][['Date'] + cols]
dff0.columns = ['Date'] + ['Countryside' +' '+x for x in cols]
dff1 = data[data['Downtown'] == 1][['Date'] + cols]
dff1.columns = ['Date'] + ['Downtown' +' '+x for x in cols]
dff = pd.merge(dff0,dff1,how='outer',on=['Date','Date'])
dff = dff.dropna().reset_index(drop=True)
dff.head()
Now we are ready to plot: below are the histograms of mean value of each pollution for the downtown and countryside
DD = dff.copy()
for k in pollutant_keys1:
key = k + pollutant_vals[0]
plt.figure(figsize=(12,6))
plt.title(key)
DD['Countryside'+' '+key].hist(bins=100)
DD['Downtown'+' '+key].hist(bins=100)
plt.legend([x for x in all_region.keys()])
plt.show()
Next, let's plot the changes of mean value of each pollution along time between two regions.
def plot_test(dd, key,region):
dd = data2[data2['station'].isin(region)]
dd.index = dd['Date']
dd[[x+pollutant_vals[0] for x in pollutant_keys1]].plot(title=str(key), figsize=(14, 6))
plt.show()
for key,values in all_region.items():
plot_test(data2, key, values)
Appearantly, CO is way larger than other pollutants and it is hard to get obervation of other pollutants. Therefore, we can remove CO to see the change of other pollutants.
pollutant_keys2 = ['NO2', 'O3', 'SO2']
pollutant_vals = ['Mean', 'Max']
all_indicators2 = []
for x1 in pollutant_keys2:
for x2 in pollutant_vals:
all_indicators2.append(x1+x2)
identifier = ['station', 'Date']
time_feature = ['year', 'month', 'day']
data3 = data1[all_indicators2].reset_index()
def plot_test(dd, key,region):
dd = data3[data3['station'].isin(region)]
dd.index = dd['Date']
dd[[x+pollutant_vals[0] for x in pollutant_keys2]].plot(title=str(key), figsize=(14, 6))
plt.ylim(0, 300)
plt.show()
for key,values in all_region.items():
plot_test(data3, key, values)
3.1 Data Preparation for model
3.2 Gradient Boosting v.s. Naive Model
3.3 Deep Learning Neural Network
indic = all_indicators.copy()
indic.append('wd')
indic
def find_mean(data1,region):
dff2 = data1[indic].reset_index()
dff3 = dff2[dff2['station'].isin(region)]
DF1=dff3[['Date','NO2Mean','O3Mean','SO2Mean','COMean']].groupby('Date').mean()
DF2=dff3[['Date','NO2Max','O3Max','SO2Max','COMax']].groupby('Date').max()
DF3=dff3[['Date','wd']].groupby('Date').agg(lambda x: x.value_counts().index[0])
DF4=pd.merge(DF1, DF2, on='Date')
DF5=pd.merge(DF4, DF3, on='Date')
dff3=DF5.reset_index().sort_values('Date', ascending=False)
return dff3
def create_new_variable(all_indicators, target_index, df):
target = all_indicators[target_index]
labels = df[['Date',target]].copy()
labels[target] = df[target].shift(-1)
labels.columns = ['Date','label']
#moving average: The window has 4 values, because the characteristics of the week are very obvious,
#and the 4 values represent the week, month, quarter, and year.
windows = [7*i for i in [1,4,13,52]]
self_df = df[all_indicators].copy()
self_df0 = self_df.rolling(window=windows[0]).mean()
self_df1 = self_df.rolling(window=windows[1]).mean()
self_df2 = self_df.rolling(window=windows[2]).mean()
self_df3 = self_df.rolling(window=windows[3]).mean()
self_df0.columns = [x+'_'+str(windows[0]) for x in all_indicators]
self_df1.columns = [x+'_'+str(windows[1]) for x in all_indicators]
self_df2.columns = [x+'_'+str(windows[2]) for x in all_indicators]
self_df3.columns = [x+'_'+str(windows[3]) for x in all_indicators]
df = pd.concat([df,self_df0,self_df1,self_df2,self_df3],axis=1)
df = pd.merge(labels,df,on=['Date','Date'],how='outer')
df = df.dropna().reset_index(drop=True)
df = pd.get_dummies(df, columns=['wd'])
df = df.sort_values('Date')
return df
Downtown:
m1=find_mean(data1,downtown)
m1=create_new_variable(all_indicators, 0, m1)
X = m1.drop(['Date','label'],axis=1)
y =m1['label']
train_begin = 0.6
dev_begin = 0.8
X_train = X[:int(train_begin*len(X))]
y_train = y[:int(train_begin*len(X))]
X_dev = X[int(train_begin*len(X)):int(dev_begin*len(X))]
y_dev = y[int(train_begin*len(X)):int(dev_begin*len(X))]
X_test = X[int(dev_begin*len(X)):]
y_test = y[int(dev_begin*len(X)):]
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
model = GradientBoostingRegressor(max_depth=6,n_estimators=200,learning_rate=0.01)
model.fit(X_train,y_train)
# get the order of feature importance
impos = dict(zip(list(X_train.columns),list(model.feature_importances_)))
impos = sorted(impos.items(),key=lambda x:x[1],reverse=True)
impos = [x[0] for x in impos]
# According to the order of feature importance, the top n features are selected in turn, the validation
# set is used to select the best rmse, and the features are filtered in order.
rmses = []
for i in range(1,len(X_train.columns)):
cols = impos[:i]
X_train1 = X_train[cols].copy()
X_dev1 = X_dev[cols].copy()
model = GradientBoostingRegressor(max_depth=6,n_estimators=200,learning_rate=0.01)
model.fit(X_train1,y_train)
rmses.append(mean_squared_error(y_dev,model.predict(X_dev1)))
print(i,' done')
plt.figure(figsize=(8,4))
plt.plot(list(range(1,len(X.columns))),rmses)
plt.legend(['rmse'])
plt.show()
print('best N is :',np.argmin(np.array(rmses))+1)
print('Top N features are:',impos[:np.argmin(np.array(rmses))+1])
cols = impos[:np.argmin(np.array(rmses))+1]
X_train1 = X_train[cols].copy()
X_test1 = X_test[cols].copy()
model = GradientBoostingRegressor(max_depth=6,n_estimators=200,learning_rate=0.01)
model.fit(X_train1,y_train)
preds = model.predict(X_test1)
print('model RMSE in test set is:',mean_squared_error(y_test,preds))
print('today predict tomorrow:',mean_squared_error(list(y_test)[1:],list(y_test[:-1])))
plt.figure(figsize=(16,6))
plt.grid()
plt.title('Gradient Boosting Downtown')
plt.xlim(100,200)
plt.plot(list(y_test))
plt.plot(preds)
plt.legend(['true','prediction'])
plt.show()
plt.figure(figsize=(16,6))
plt.grid()
plt.title('today predict tomorrow Downtown')
plt.xlim(100,200)
plt.plot(list(y_test)[1:])
plt.plot(list(y_test)[:-1])
plt.legend(['true','prediction'])
plt.show()
import tensorflow as tf
series = np.array(m1['label'])
time = np.array(m1['Date'])
# There are 3 years data of 1097 data points in series
split_time = round(1097*0.8)
time_train = time[:split_time]
x_train = series[:split_time]
time_valid = time[split_time:]
x_valid = series[split_time:]
window_size = 30
batch_size = 32
shuffle_buffer_size = 1000
def windowed_dataset(series, window_size, batch_size, shuffle_buffer):
series = tf.expand_dims(series, axis=-1)
ds = tf.data.Dataset.from_tensor_slices(series)
ds = ds.window(window_size+1, shift=1, drop_remainder=True)
ds = ds.flat_map(lambda w: w.batch(window_size+1))
ds = ds.shuffle(shuffle_buffer)
ds = ds.map(lambda w: (w[:-1], w[1:]))
return ds.batch(batch_size).prefetch(1)
tf.keras.backend.clear_session()
tf.random.set_seed(51)
np.random.seed(51)
window_size = 64
batch_size = 256
train_set = windowed_dataset(x_train, window_size, batch_size, shuffle_buffer_size)
print(train_set)
print(x_train.shape)
model2 = tf.keras.models.Sequential([
tf.keras.layers.Conv1D(filters=32, kernel_size=5,
strides=1, padding='causal',
activation = "relu",
input_shape=[None,1]),
tf.keras.layers.LSTM(64, return_sequences=True),
tf.keras.layers.LSTM(64, return_sequences=True),
tf.keras.layers.Dense(30, activation = "relu"),
tf.keras.layers.Dense(10, activation = "relu"),
tf.keras.layers.Dense(1),
tf.keras.layers.Lambda(lambda x: x*400) #scaling this upwith match up with the order of output
])
lr_schedule = tf.keras.callbacks.LearningRateScheduler(
lambda epoch: 1e-8 * 10**(epoch / 20))
# lambda epoch: 0.001 * tf.math.exp(0.1 * (10 - epoch)))
optimizer = tf.keras.optimizers.SGD(lr=1e-8, momentum=0.9)
model2.compile(loss=tf.keras.losses.Huber(),
optimizer=optimizer,
metrics=[tf.keras.metrics.RootMeanSquaredError()])
history = model2.fit(train_set, epochs=100, callbacks=[lr_schedule])
plt.semilogx(history.history["lr"], history.history["loss"])
plt.axis([1e-8, 1e-4, 0, 60])
tf.keras.backend.clear_session()
tf.random.set_seed(51)
np.random.seed(51)
train_set = windowed_dataset(x_train, window_size=60, batch_size=100, shuffle_buffer=shuffle_buffer_size)
model2 = tf.keras.models.Sequential([
tf.keras.layers.Conv1D(filters=32, kernel_size=5,
strides=1, padding='causal',
activation = "relu",
input_shape=[None,1]),
tf.keras.layers.LSTM(64, return_sequences=True),
tf.keras.layers.LSTM(64, return_sequences=True),
tf.keras.layers.Dense(30, activation = "relu"),
tf.keras.layers.Dense(10, activation = "relu"),
tf.keras.layers.Dense(1),
tf.keras.layers.Lambda(lambda x: x*400) #scaling this upwith match up with the order of output
])
optimizer = tf.keras.optimizers.SGD(lr=1e-5, momentum=0.9)
model2.compile(loss=tf.keras.losses.Huber(),
optimizer=optimizer,
metrics=[tf.keras.metrics.RootMeanSquaredError()])
history = model2.fit(train_set,epochs=150)
def model_forecast(model, series, window_size):
ds = tf.data.Dataset.from_tensor_slices(series)
ds = ds.window(window_size, shift=1, drop_remainder=True)
ds = ds.flat_map(lambda w: w.batch(window_size))
ds = ds.batch(32).prefetch(1)
forecast = model.predict(ds)
return forecast
rnn_forecast = model_forecast(model2, series[...,np.newaxis], window_size) #np.newaxis creates one more dimension for series, will is the required input for from_tensor_slices
rnn_forecast = rnn_forecast[split_time - window_size:-1, -1, 0]
def plot_series(time, series, format="-", start=0, end=None):
plt.plot(time[start:end], series[start:end], format)
plt.xlabel("Time")
plt.ylabel("Value")
plt.grid(True)
plt.figure(figsize=(14, 6))
plot_series(time_valid, x_valid)
plot_series(time_valid, rnn_forecast)
tf.keras.metrics.mean_absolute_error(x_valid, rnn_forecast).numpy()
import shap
shap.initjs()
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test1)
# summarize the effects of all the features
shap.summary_plot(shap_values, X_test1)
shap.summary_plot(shap_values,X_test1,plot_type = "bar")
shap_interaction_values = explainer.shap_interaction_values(X_test1)
shap.summary_plot(shap_interaction_values, X_test1)
It is the average of the marginal contributions across all permutations.
4.1 Daily Pattern
4.2 Monthly Pattern
4.3 Yeary Pattern
4.4 Relationship between TMAX(maximum temperature) and the pollutants
target1 = ['SO2 1st Max Hour']
dd = df_SO2[~df_SO2['station'].isin (countryside)].copy()
dd[target1].hist(bins=100,figsize=(14,6))
plt.legend(target1)
plt.show()
target2 = ['O3 1st Max Hour']
dd2 = df_O3[~df_O3['station'].isin (countryside)].copy()
dd2[target2].hist(bins=100,figsize=(14,6))
plt.legend(target2)
plt.show()
target3 = ['CO 1st Max Hour']
dd3 = df_CO[~df_CO['station'].isin (countryside)].copy()
dd3[target3].hist(bins=100,figsize=(14,6))
plt.legend(target3)
plt.show()
target4 = ['NO2 1st Max Hour']
dd4 = df_NO2[~df_CO['station'].isin (countryside)].copy()
dd4[target4].hist(bins=100,figsize=(14,6))
plt.legend(target4)
plt.show()
keys = ['PM2.5','PM10', 'SO2', 'NO2', 'CO', 'O3']
functions = ['Mean', 'Max']
def plot_month_box(df, target):
dd = data[['month',target]]
all_months = sorted(list(set(list(dd['month'].astype(int)))))
ddd = pd.DataFrame(columns=all_months)
for y in all_months:
temp = data[data['month']==float(y)].reset_index()
ddd[y] = temp[target]
ddd.plot.box(figsize=(14,6),title=target)
for key in keys:
data1 = data[~data['station'].isin (countryside)].copy()
target = key+functions[0]
plot_month_box(data1,target)
plt.title('month'+'downtown'+target)
def plot_year_box(df, target):
dd = data[['year',target]]
all_years = sorted(list(set(list(data['year'].astype(int)))))
ddd = pd.DataFrame(columns=all_years)
for y in all_years:
temp = data[data['year']==float(y)].reset_index()
ddd[y] = temp[target]
ddd.plot.box(figsize=(14,6),title=target)
for key in keys:
df1 = data[~data['station'].isin (countryside)].copy()
target = key+functions[0]
plot_year_box(data1,target)
plt.title('year'+'downtown'+target)
targets=['PM2.5Mean', 'PM10Mean', 'SO2Mean', 'NO2Mean', 'COMean', 'O3Mean']
def plot_temp_target(target):
dd = data[~data['station'].isin (countryside)].copy()
dd.index = dd['Date']
dd = dd[[target,'TEMPMean']].copy()
dd = (dd-dd.mean())/dd.std()
dd = dd.rolling(window=100).mean()
dd.plot(figsize=(14,6),grid=True)
for target in targets:
plot_temp_target(target)
plt.figure(figsize=(14, 6))
plot_series(time, series)